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Abstract. A central problem in comparative genomics consists in computing a (dis-)similarity measure 
. between two genomes, e.g. in order to construct a phylogenetic tree. A large number of such measures 

has been proposed in the recent past: number of reversals, number of breakpoints, number of common 
or conserved intervals, SAD etc. In their initial definitions, all these measures suppose that genomes 
contain no duplicates. However, we now know that genes can be duplicated within the same genome. One 
possible approach to overcome this difficulty is to establish a one-to-one correspondence (i.e. a matching) 
between genes of both genomes, where the correspondence is chosen in order to optimize the studied 
measure. Then, after a gene relabeling according to this matching and a deletion of the unmatched 
O | signed genes, two genomes without duplicates are obtained and the measure can be computed. 

In this paper, we are interested in three measures (number of breakpoints, number of common intervals 
and number of conserved intervals) and three models of matching (exemplar, intermediate and maximum 
matching models). We prove that, for each model and each measure M, computing a matching between 
two genomes that optimizes M is APX-hard. We show that this result remains true even for two 
' genomes G\ and Gi such that G\ contains no duplicates and no gene of G2 appears more than twice. 

C**) , Therefore, our results extend those of [7, 10, 13]. Besides, in order to evaluate the possible existence of 

approximation algorithms concerning the number of breakpoints, we also study the complexity of the 
following decision problem: is there an exemplarization (resp. an intermediate matching, a maximum 
matching) that induces no breakpoint ? In particular, we extend a result of [13] by proving the problem 
\& • to be NP-complete in the exemplar model for a new class of instances, we note that the problems 

' are equivalent in the intermediate and the exemplar models and we show that the problem is in P in 

OO . the maximum matching model. Finally, we focus on a fourth measure, closely related to the number 

of breakpoints: the number of adjacencies, for which we give several constant ratio approximation 
algorithms in the maximum matching model, in the case where genomes contain the same number of 
duplications of each gene. 
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1 Introduction and Preliminaries 



In comparative genomics, computing a measure of (dis-)similarity between two genomes is a central 
problem: such a measure can be used, for instance, to construct phylogenetic trees. The measures 
defined so far essentially fall into two categories: the first one consists in counting the minimum 
number of operations needed to transform a genome into another (e.g. the edit distance [21] or the 
number of reversals [4]). The second one contains (dis-)similarity measures based on the genome 
structure, such as the number of breakpoints [7], the conserved intervals distance [6], the number of 
common intervals [10], SAD and MAD [24] etc. 



When genomes contain no duplicates, most measures can be computed in polynomial time. 
However, assuming that genomes contain no duplicates is too limited. Indeed, it has been recently 
shown that a great number of duplicates exists in some genomes. For example, in [20], authors 
estimate that 15% of genes are duplicated in the human genome. A possible approach to overcome 
this difficulty is to specify a one-to-one correspondence (i.e. a matching) between genes of both 
genomes and to remove the unmatched genes, thus obtaining two genomes with identical gene 
content and no duplicates. Usually, the above mentioned matching is chosen in order to optimize 
the studied measure, following the parsimony principle. Three models achieving this correspondence 
have been proposed : the exemplar model [23], the intermediate model [3] and the maximum 
matching model [25]. Before defining precisely the measures and models studied in this paper, we 
need to introduce some notations. 

Notations used in the paper. A genome G is represented by a sequence of signed integers (called 
signed genes). For any genome G, we denote by Tg the set of unsigned integers (called genes) that 
are present in G. For any signed gene g, let — g be the signed gene having the opposite sign and let 
\g\ G J~g be the corresponding (unsigned) gene. 

Given a genome G without duplicates and two signed genes a, b such that a is located before 
b, let G[a, b] be the set S C Tq of genes located between genes a and b in G, a and b included. 
We also note [a, b]c the substring (i.e. the sequence of consecutive elements) of G starting at a and 
finishing at b in G. 

Let occ(<7, G) be the number of occurrences of a given gene g in a genome G and let occ(G) = 
max{occ(<7, G)\g G Tq}- A pair of genomes (G\,G2) is said to be of type (x, y) if occ(Gi) = x and 
occ(G2) = y. A pair of genomes (G\,G2) is said to be balanced if, for each gene g G Tq x U Tg^i we 
have occ(g,G\) = occ(g, G2) (otherwise, (Gi,G%) will be said to be unbalanced). Note that a pair 
(Gi,^) of type (x,x) is not necessary balanced. 

Denote by n G the size of genome G, that is the number of signed genes it contains. Let G[p], 
1 < p < no, be the signed gene that occurs at position p on genome G, and let \G[p]\ G Tg be the 
corresponding (unsigned) gene. Let Ng[p], 1 < p < tig, be the number of occurrences of \G\p]\ in 
the first (p — 1) positions of G. 

We define a duo in a genome G as a pair of successive signed genes. Given a duo d{ = (G[i], G[i + 
1]) in a genome G, we note — di the duo equal to (—G[i + 1], — G[i\). Let (d±, d 2 ) be a pair of duos ; 
(di,d 2 ) is called a duo match if d\ is a duo of G±, d 2 is a duo of G2, and if either d\ = d 2 or 
d\ = -d 2 . 

For example, consider the genome G\ = +1 +2 +3 +4 +5 — 1 — 2 +6 — 2. Then, 
T G = {1,2,3,4,5,6}, n Gl =9, occ(l.Gi) = 2, occ(Gi) = 3, Gi[7] = -2, -G x [7] = +2, |Gi[7]| =2 
and iV Gl [7] = 1. Let G 2 be the genome G 2 = +2 -1 +6 +3 -5 -4+2-1 -2. Then the pair (G u 
G2) is balanced and is of type (3,3). Let d\ = (Gi[4], Gi[5]) be the duo (+4, +5) and d 2 be the duo 
(G2 [5] , G2 [6] ) . The pair [d\ , d 2 ) is a duo match. Now, consider the genome G3 = +3 — 2 + 6 + 4 — 1 + 5 
without duplicates. We have G 3 [+6, -1] = {1,4,6} and [+6, -l] Ga = (+6, +4,-1). 

Breakpoints, adjacencies, common and conserved intervals. Let us now define the four measures 
we will study in this paper. Let G\ and G2 be two genomes without duplicates and with the same 
gene content, that is = !Fg 2 - 

Breakpoint and Adjacency. Let (a, b) be a duo in G\. We say that the duo (a,b) induces a 
breakpoint of (Gi,G2) if neither (a, b) nor (—6, —a) is a duo in G2. Otherwise, we say that (a, b) 
induces an adjacency of (Gi, G2). For example, when G\ = +1 + 2 + 3 + 4 + 5 and G2 = +5 — 
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4 — 3 + 2 + 1, the duo (2, 3) in G\ induces a breakpoint of (Gi, G 2 ) while (3,4) in G\ induces an 
adjacency of (Gi,G 2 ). We note B{G\,G 2 ) (resp. A{G\,G2)) the number of breakpoints (resp. the 
number of adjacencies) that exist between G\ and Gi- 

Common interval. A common interval of {G\,G<i) is a substring of G\ such that G2 contains a 
permutation of this substring (not taking signs into account). For example, consider G\ = +1 + 2 + 
3 + 4 + 5 and G2 = +2 — 4 + 3 + 5 + 1. The substring [+3, +5]d is a common interval of (Gi, G 2 ). 

Conserved interval. Consider two signed genes a and b of G\ such that a precedes b, where the 
precedence relation is large in the sense that, possibly, a = b. The substring [a, b\o 1 is a conserved 
interval of (Gi,^) if either (i) a precedes b and G 2 [a, b] = G\[a,b], or (ii) —b precedes —a and 
G 2 [-b, -a] = Gx[a, b]. For example, if G t = +1 + 2 + 3 + 4 + 5 and G 2 = -5-4 + 3-2 + 1, 
the substring [+2, +5]d is a conserved interval of (Gi, G2). We note that the notion of conserved 
interval does not consider the sign of genes. Note also that a conserved interval is actually a common 
interval, but with additional restrictions on its extremities. 

Dealing with duplicates in genomes. When genomes contain duplicates, we cannot directly com- 
pute the measures defined in the previous paragraph. A solution consists in finding a one-to-one 
correspondence (i.e. a matching) between duplicated genes of G\ and G2 ; we then use this corre- 
spondence to rename genes of G\ and G 2 , and we delete the unmatched signed genes in order to 
obtain two genomes G[ and G' 2 such that G' 2 is a permutation of G± ; thus, the measure compu- 
tation becomes possible. In this paper, we will focus on three models of matching : the exemplar, 
intermediate and maximum matching models. 

— The exemplar model [23]: for each gene g, we keep in the matching A4 only one occurrence of g 
in G\ and in G 2 , and we remove all the other occurrences. Hence, we obtain two genomes Gf 
and G 2 without duplicates. The triplet (Gf,G 2 ,Ai) is called an exemplarization of (Gi,G 2 ). 
Note that in this model, A4. can be inferred from the exemplarized genomes Gf and G 2 . Thus, 
in the rest of the paper, any exemplarization (Gf,G 2 ,A4) of (Gi,G2) will be only described 
by the pair (Gf,Gf). 

— The intermediate model [3]: in this model, for each gene g, we keep in the matching Ai an 
arbitrary number k g , 1 < k g < min(occ(g,Gi),occ(g,G 2 )), in order to obtain genomes G{ and 
G 2 . We call the triplet (G{,G2,.M) an intermediate matching of (Gi,G2). 

— The maximum matching model [25]: in this case, we keep in the matching Ai the maximum 
number of signed genes in both genomes. More precisely, we look for a one-to-one correspondence 
between signed genes of G\ and G2 that matches, for each gene g, exactly min(occ(g,Gi), 
occ(<7, G2)) occurrences. After this operation, we delete each unmatched signed gene. The triplet 
(Gi , G 2 , Ai) obtained by this operation is called a maximum matching of (Gi, G2). 

Problems studied in this paper. Consider two genomes G\ and G2 with duplicates. Let EComI 
(resp. IComI, MComI) be the problem which consists in finding an exemplarization (resp. inter- 
mediate matching, maximum matching) (G' l5 G' 2 ,Ai) of (Gi, G2) such that the number of common 
intervals of (G[,G 2 ) is maximized. Moreover, let EConsI (resp. IConsI, MConsI) be the problem 
which consists in finding an exemplarization (resp. intermediate matching, maximum matching) 
(G^jG^Ai) of (Gi,G2) such that the number of conserved intervals of (G'^G^-M) is maximized. 
In Section 2, we prove the APX-hardness of EComI and EConsI, even for genomes G\ and 
G2 such that occ(Gi) = 1 and occ(G2) = 2. These results induce the APX-hardness under the 
other models (i.e., IComI, MComI, IConsI and MConsI are APX-hard). These results extend 
in particular those of [7, 10]. 
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Let EBD (resp. IBD, MBD) be the problem which consists in finding an exemplarization (resp. 
intermediate matching, maximum matching) (G'^G^-M) of (Gi,G2) that minimizes the number 
of breakpoints between G' x and G 2 . In Section 3, we prove the APX-hardness of EBD, even for 
genomes G\ and G2 such that occ(Gi) = 1 and occ(G2) = 2. This result implies that IBD and 
MBD are also APX-hard, and extends those of [13]. 

Let ZEBD (resp. ZIBD, ZMBD) be the problem which consists in determining, for two genomes 
G\ and G2, whether there exists an exemplarization (resp. intermediate matching, maximum match- 
ing) which induces zero breakpoint. In section 4, we study the complexity of ZEBD, ZMBD and 
ZIBD: in particular, we extend a result of [13] by proving ZEBD to be NP-complete for a new 
class of instances. We also note that the problems ZEBD and ZIBD are equivalent, and we show 
that ZMBD is in P. 

Finally, in Section 5, we focus on a fourth measure, closely related to the number of breakpoints: 
the number of adjacencies, for which we give several constant ratio approximation algorithms in 
the maximum matching model, in the case where genomes are balanced. 

2 EComl and EConsI are APX-hard 

Consider two genomes G\ and G2 with duplicates, and let EComI (resp. IComI, MComI) be 
the problem which consists in finding an exemplarization (resp. intermediate matching, maximum 
matching) (G' l5 G' 2 ,A4) of (G±, G2) such that the number of common intervals of (G' 1; G' 2 ) is maxi- 
mized. Moreover, let EConsI (resp. IConsI, MConsI) be the problem which consists in finding an 
exemplarization (resp. intermediate matching, maximum matching) (G'i,G 2 ,A4) of (Gi,G2) such 
that the number of conserved intervals of (Gi,G 2 ,M) is maximized. 

EComI and MComI have been proved to be NP-complete even if occ(Gi) = 1 and occ(G2) = 2 
in [10]. Besides, in [6], Blin and Rizzi have studied the problem of computing a distance built on 
the number of conserved intervals. This distance differs from the number of conserved intervals 
we study in this paper, mainly in the sense that (i) it can be applied to two sets of genomes 
(as opposed to two genomes in our case), and (ii) the distance between two identical genomes of 
length n is equal to (as opposed to n ( n 2 +1 ) in our case). Blin and Rizzi [6] proved that finding 
the minimum distance is NP-complete, under both the exemplar and maximum matching models. 
A closer analysis of their proof shows that it can be easily adapted to prove that EConsI and 
MConsI are NP-complete, even in the case occ(Gi) = 1. 

We can conclude from the above results that IComI and IConsI are also NP-complete, since 
when one genome contains no duplicates, exemplar, intermediate and maximum matching models 
are equivalent. 

In this section, we improve the above results by showing that the six problems EComI, IComI, 
MComI, EConsI, IConsI and MConsI are APX-hard, even when genomes G\ and G2 are such 
that occ(Gi) = 1 and occ(G2) = 2. The main result is Theorem 1, which will be completed by 
Corollary 1 at the end of the section. 

Theorem 1. EComI and EConsI are APX-hard even when genomes G\ and G2 are such that 
occ(Gi) = 1 and occ(G2) = 2. 

We prove Theorem 1 by using an L-reduction [22] from the Min- Vertex-Cover problem on 
cubic graphs, denoted here Min-Vertex-Cover-3. Let G = (V, E) be a cubic graph, i.e. for all 
v € V, degree(v) = 3. A set of vertices V' C V is called a vertex cover of G if for each edge e € E, 
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there exists a vertex v S V such that e is incident to v. The problem Min-Vertex-Cover-3 is 
defined as follows: 

Problem: Min-Vertex-Cover-3 
Input: A cubic graph G = (V, E). 
Solution: A vertex cover V of G. 
Measure: The cardinality of V . 

Min-Vertex-Cover-3 was proved to be APX-complete in [1]. 
2.1 Reduction 

Let G = (V, E) be an instance of Min-Vertex-Cover-3, where G is a cubic graph with V = 
{v\ . . . v n } and E = {e\ . . . e m }. Consider the transformation R which associates to the graph G 
two genomes G\ and G2 in the following way, where each gene has a positive sign. 

G 1 = h b 2 . . . b m X (X\ C\ f\ d 2 C2 f2 • ■ ■ 0"n C n f n y b m+n , 6 m+n -i • • • b m+1 (1) 

(2) 

with : 

— for each i, 1 < i < n, m = 6i — 5, fi = 6i 

— for each i, 1 < i < n, Q = (a^ + 1), (a, + 2), (a, + 3), (a, + 4) 

— for each i, 1 < i < n + m,bi = 6n + i 

— x = 7n + m + 1 and y = 7n + m + 2 

— for each i, 1 < i < n, Di = (a* + 3), (6jJ, (oj + 1), (b ki ), (a, + 4), (fy.), (a, + 2) where e jt , e ki and 

are the edges which are incident to Vi in G, with ji < ki < U. 

In the following, genes bi, 1 < i < m, are called markers. There is no duplicated gene in G\ and 
the markers are the only duplicated genes in G 2 ; these genes occur twice in G 2 . Hence, we have 
occ(Gi) = 1 and occ(G2) = 2. 




Fig. 1. The cubic graph G. 



To illustrate the reduction, consider the cubic graph G of Figure 1. From G, we construct the 
following genomes G\ and G 2 : 

hi b 2 b 3 bi 65 b 6 x Ci C 2 C 3 Ci y b w bg bg b 7 

/ 25 V '^6 V '27 V '28^'29 V '30 V '35 V 1^3^56 78Tl0lT 12 13 14 15 16 17 18 19 20 21 22 23 24^'34^'33^'32 V '31 N 
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2.2 Preliminary results 



In order to prove Theorem 1, we first give four intermediate lemmas. In the following, a common 
interval for the EComI problem or a conserved interval for EConsI is called a robust interval. 
Besides, a trivial interval will denote either an interval of length one (i.e. a singleton), or the whole 
genome. 

Lemma 1. For any exemplarization (G^Gf') of {G\,G2), the non trivial robust intervals of 
(G^Gf') are necessarily contained in some sequence aiCifi of G\ (1 <i<n). 

Proof. We start by proving the lemma for common intervals, and we will then extend it to conserved 
intervals. First, we prove that, for any exemplarization (G\, Gf') of [G\, G2), each common interval 
/ such that \I\ > 2 contains either both of x, y or none of them. This further implies that / covers 
the whole genome. Suppose there exists a common interval I x (recall that by definition I x is on 
G\) such that \I X \ > 2 and I x contains x. Let PI X be the permutation of I x in G^. The interval I x 
must contain either b m or a\. Let us detail each of the two cases: 

(a) If I x contains b m , then PI X contains b m too. Notice that there is some i, 1 < i < n, such that b m 
belongs to Di in Gf ■ Then PI X contains all genes between Di and x in Gf ■ Thus PI X contains 
b m+n . Consequently, I x contains b m+n and it also contains y. 

(b) If I x contains a±, then PI X contains a\ too. Then PI X contains all genes between a\ and x. 
Thus PI X contains 6 m +n- Hence, I x contains 6 m + n and then it also contains y. 

Now, suppose that I y is a common interval such that \I y \ > 2 and I y contains y. Let PI y be the 
permutation of I y on Gf\ The interval I y must contain either 6 m + n or f n . Let us detail each of the 
two cases: 

(a) If I y contains 6 m ,+ n , then PI y contains b m+n too. Thus PI y contains all genes between fe m + n 
and y. Hence PI y contains all the sequences Di, 1 < i < n. In particular, PI y contains all the 
markers and consequently I y must contain x. 

(b) If I y contains f n , then PI y contains f n too. Then PI y contains all genes between f n and y. 
In particular, PI y contains 6 m + n _i and then I y contains 6 m+n _i too. Hence, I y also contains 
bm+n, similarly to the previous case. Thus I y contains x. 

We conclude that each non singleton common interval containing either x or y necessarily 
contains both x and y. Therefore, and by construction of G2, there is only one such interval, that 
is G\ itself. Hence, any non trivial common interval is necessarily, in G%, either strictly on the left 
of x, or between x and y, or strictly on the right of y. Let us analyze these different cases: 

— Let / be a non trivial common interval situated strictly on the left of x in G\ . Thus / is a sequence 
of at least two consecutive markers. Since in any exemplarization (Gi,Gf') of (Gi,^), every 
marker in Gf has neighboring genes which are not markers, this contradicts the fact that / is 
a common interval. 

— Let / be a non trivial common interval situated strictly on the right of y in G\. Then / is a 
substring of b m+n , . . . , 6 m +i containing at least two genes. In any exemplarization (Gi, Gf) of 
(Gi, G 2 ), for each pair (b m+i , b m+i+ i) of Gf , with 1 < i < n, we have a i+1 £ Gf [b m+i , b m+i+1 ]. 
This contradicts the fact that / is strictly on the right of y in G\. 
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— Let 7 be a non trivial common interval lying between x and y in G±. For any exemplarization 
(G^Gf') of (Gi,(?2)) a common interval cannot contain, in G\, both /j and aj+i for some i, 
1 < i < n — 1 (since b m+ i is situated between and aj+i in Gf and on the right of x in Gi). 
Hence, a non trivial common interval of (G^Gf) is included in some sequence aiCifi in Gi, 
1 < i < n. 

This proves the lemma for common intervals. By definition, any conserved interval is necessarily 
a common interval. So, a non trivial conserved interval of (Gi,Gf) is included in some sequence 
a iCifi in Gi, 1 < z < n. The lemma is proved. □ 

Lemma 2. Le£ (Gi, Gf") 6e an exemplarization of (G\, G2) and i G [1 . . . n]. Let Ai be a substring 
°f [ a i + 3, ai + 2] g b that does not contain any marker. If \Ai\ G {2,3}, then there is no robust 
interval I of (Gi, Gf") such that Ai is a permutation of I. 

Proof. First, we prove that there is no permutation / of Ai such that / is a common interval of 
(Gi, G^)- Next, we show that there is no permutation / of Ai such that / is a conserved interval. By 
Lemma 1, we know that a non trivial common interval of (Gi, Gf) is a substring of some sequence 
Q-iCifi, 1 < i < n. This substring contains only consecutive integers. Therefore, if there exists a 
permutation / of Ai such that / is a common interval of (Gi, Gf'), then Ai must be a permutation 
of consecutive integers. If \Ai\ = 2, we have Ai = (p, q) where p and q are not consecutive integers 
and if \Ai\ = 3, then we have Ai = (ai + 3, aj + 1, <2j + 4) or Ai = (ai + 1, a, + 4, Oj + 2). In these 
three cases, Ai is not a permutation of consecutive integers. Hence, there is no permutation / of Ai 
such that / is a common interval of (Gi,Gf"). Moreover, any conserved interval is also a common 
interval. Thus, there is no permutation / of Ai such that / is a conserved interval of (G±, G^). □ 

For more clarity, let us now introduce some notations. Given a graph G = (V, E), let VC = 
{v^jViz ...Vi k } be a vertex cover of G. Let R(G) = (G\,G2) be the pair of genomes defined by 
the construction described in (1) and (2). Now, let F be the function which associates to VC, G\ 
and Gi an exemplarization F(VC) of (Gi, G2) as follows. In G2, all the markers are removed from 
the sequences Di for all i 7^ *i , *2 - • • *fc- Next, for each marker which is still present twice, one of 
its occurrences is arbitrarily removed. Since in G2 only markers are duplicated, we conclude that 
F(VC) is an exemplarization of (Gi,G2). 

Given a cubic graph G and genomes G± and G2 obtained by the transformation R(G), let us 
define the function S which associates to an exemplarization (Gi, G^) of (Gi, G2) the vertex cover 
VC of G defined as follows: VC = {vi\l < i < n A 3 j £ {1 . . . m}, bj € Gf[aj, fi]}. In other words, 
we keep in VC the vertices v i of G for which there exists some gene bj such that bj is in G^ [ai , fi] . 
We now prove that VC is a vertex cover. Consider an edge e p of G. By construction of G\ and G2, 
there exists some i, 1 < i < n, such that gene b p is located between and fi in Gf '. The presence 
of gene b p between ai and fi implies that vertex Vi belongs to VC. We conclude that each edge is 
incident to at least one vertex of VC. 

Let W be the function defined on {EConsI, EComI} by VF(pb) = 1 if pb = EConsI and 
VF(pb) = 4 if pb = EComI. Let optp(A) be the optimum result of an instance A for an optimization 
problem pb, pb G {EcomI, EConsI, Min-Vertex-Cover-3}. 

We now define the function T whose arguments are a problem pb G {EConsI, EComI} and a 
cubic graph G. Let R(G) = (G±,Gf) as usual. Then T(pb, G) is defined as the number of robust 
trivial intervals of (G\, G^) with respect to pb. Let n and m be respectively the number of vertices 
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and the number of edges of G. We have T(EConsI, G) = 7n+m+2 and T(EComI, G) = 7n+m+3. 
Indeed, for EComI, there are 7n + m + 2 singletons and we also need to consider the whole genome. 

Lemma 3. Let pb € {EcomI, EConsI}. Let G be a cubic graph and R(G) = (Gi,G 2 ). Let 
{G\,G 2 ) be an exemplarization of (Gi,G 2 ) and let i, 1 < i < n. Then only two cases can oc- 
cur with respect to Di. 

1. Either in G 2 , all the markers from Di were removed, and in this case, there are exactly W(pb) 
non trivial robust intervals involving Di. 

2. Or in G 2 , at least one marker was kept in Di, and in this case, there is no non trivial robust 
interval involving Di. 

Proof. We first prove the lemma for the EComI problem and then we extend it to EConsI. 
Lemma 1 implies that each non trivial common interval I of (G\, G 2 ) is contained in some substring 
of aiCifi, 1 < i < n. So, the permutation of / on G 2 is contained in a substring of OjDj/j, 1 < i < n. 

Consider i, 1 < i < n, and suppose that all the markers from Dj are removed on Gf ■ Thus, 
diCifi, Ci, aiCi and Cifi are common intervals of (Gi, G 2 ). Let us now show that there is no other 
non trivial common interval involving Di. Let Ai be a substring of [ai + 3, + 2]qe such that 
\Ai\ £ {2, 3}. By Lemma 2, we know that Ai is not a common interval. The remaining intervals are 
(ai,ai + 3), (cii,ai + 3,ai + 1), (a is a« + 3, a, + 1, ai + 4), (a t + 1, a; + 4, aj + 2, ft), (a; + 4, a, + 2, /;) 
and (oj + 2, fi). By construction, none of them can be a common interval, because none of them 
is a permutation of consecutive integers. Hence, there are only four non trivial common intervals 
involving Di in G 2 . Among these four common intervals, only aiCifi is a conserved interval too. In 
the end, if all the markers are removed from Di, there are exactly four non trivial common intervals 
and one non trivial conserved interval involving Di. So, given a problem pb € {EcomI, EconsI}, 
there are exactly VK(pb) non trivial robust intervals involving Di. 

Now, suppose that at least one marker of Di is kept in G 2 . Lemma 1 shows that each non 
trivial common interval / of {G\,G 2 ) is contained in some substring of OjCj/j, 1 < i < n. Since 
no marker is present in a sequence ajCj/j, we deduce that there does not exist any trivial common 
interval containing a marker. So, a non trivial common interval involving Di only must contain a 
substring Ai of [aj + 3, Oj + 2] g b such that Ai contains no marker. Since no marker is an extremity 
of [ai + 3, Oj + 2] g e, we have \Ai\ < 3. By Lemma 2, we know that Ai is not a common interval. 
The remaining intervals to be considered are the intervals aiAi and Aifi. By construction of aiCifi, 
these intervals are not common intervals (the absence of gene a, + 2 for ajZ\j and of gene ai + 3 
for Aifi implies that these intervals are not a permutation of consecutive integers). Hence, these 
intervals cannot be conserved intervals either. □ 

Lemma 4. Let pb G {EcomI, EConsI}. Let G = (V, E) be a cubic graph with V = {v\ . . . v n } and 
E = {ei . . . e m } and let G\, G 2 be the two genomes obtained by R(G). 

1. Let VC be a vertex cover of G and denote k = \VC\. Then the exemplarization F{VC) of 
(G\, G 2 ) has at least N = n W(pb) + T(pb, G) — W(pb) ■ k robust intervals. 

2. Let (G\,G 2 ) be an exemplarization of (G\,G 2 ) and let VC' be the vertex cover of G obtained 
by S(Gi,G 2 ). Then \VC'\ = WS2°} n ^^^' G ^ — , where N is the number of robust intervals of 
(Gi.Gf). 

Proof. 1. Let pb G {EcomI, EConsI}. Let G be a cubic graph and let G\ and G 2 be the two 
genomes obtained by R(G). Let VC be a vertex cover of G and denote k = \VC\. Let (Gi,G 2 ) be the 
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exemplarization of (Gi, G2) obtained by F(VC). By construction, we have at least {n — k) substrings 
Di in G2 for which all the markers are removed. By Lemma 3, we know that each of these substrings 
implies the existence of VF(pb) non trivial robust intervals. So, we have at least W(pb)(n — k) non 
trivial robust intervals. Moreover, by definition of T(pb, G), the number of trivial robust intervals 
of (Gi,Gf) is exactly T(pb,G). Thus, we have at least N = W(pb) • n + T(pb,G) - W(pb) ■ k 
robust intervals of (G^Gf). 

2. Let (Gi, G2 ) be an exemplarization of (G±, G2) and let n — j be the number of sequences Di, 
1 < i < n, for which all markers have been deleted in Gf. Then, by Lemmas 1 and 3, the number 
of robust intervals of (G x , Gf ) is equal to N = W(yh) ■ n + T(pb, G) - W(pb) ■ j. Let VC' be the 
vertex cover obtained by S(G\, G^). Each marker has one occurrence in Gf" and these occurrences 
lie in j sequences Di. So, by definition of S, we conclude that \VC'\ = j = ^(P b ) ^^(P b ' G ) jV i □ 



2.3 Main result 

Let us first define the notion of L-reduction [22]: let A and B be two optimization problems and 
ca, cb be respectively their cost functions. An L-reduction from problem A to problem B is a pair 
of polynomial-time computable functions R and S with the following properties: 

(a) If x is an instance of A, then R{x) is an instance of B ; 

(6) If x is an instance of A and y is a solution of R(x), then is a solution of A ; 

(c) If x is an instance of A and is its corresponding instance of B, then there is some positive 
constant a such that opt B (R(x)) < a.opt A (x) ; 

(d) If s is a solution of R(x), then there is some positive constant (3 such that 
|opt A (x) - c A (S(s))\ < (3\opt B (R(x)) - c B (s)\. 

We prove Theorem 1 by showing that the pair (R, S) defined previously is an L-reduction from 
Min-Vertex-Cover-3 to EConsI and from Min-Vertex-Cover-3 to EComI. First note that 
properties (a) and (6) are obviously satisfied by R and S. 

Consider pb € {EcomI, EConsI}. Let G = (V, E) be a cubic graph with n vertices and m 
edges. We now prove properties (c) and (d). Consider the genomes G\ and G2 obtained by R(G). 
For sake of clarity, we abbreviate here and in the following opt MlN _ VERTEX -covER-3 to opt MlN _ V c- First, 
we need to prove that there exists a > such that opt pb (Gi, G2) < a.opt MlN _y ERTEX _ CovER _ 3 (G). 

Since G is cubic, we have the following properties: 

n > 4 (3) 

1 n 3n 
m = - ^2 degree(vi) = — (4) 

i=i 

opt M iN-Vc( G ) > Y = 2 ( 5 ) 

To explain property (5), remark that, in a cubic graph G with n vertices and m edges, each 
vertex covers three edges. Thus, a set of k vertices covers at most 3k edges. Hence, any vertex cover 
of G must contain at least y vertices. 

By Lemma 3, we know that sequences of the form ajGj/j, 1 < % < n, contain either zero or 
VF(pb) non trivial robust intervals. By Lemma 1, there are no other non trivial robust intervals. 
So, we have the following inequality: 
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opt pb (Gi,G 2 ) < +W(pb)-n 

trivial robust intervals 



If pb = EComI, we have: 



optECoMi^i) G z) ^ 7n + m + 3 + 4n 
27n 

opt ECoMl (Gi,G 2 ) <— by (3) and (4) (6) 
And if pb = EConsI, we have : 

P t ECoNsi(G , i,G 2 ) <7n + m + 2 + n 
21n 

opt ECoNSl (G l5 G 2 ) < — by (3) and (4) (7) 
Altogether, by (5), (6) and (7), we prove property (c) with a = 27. 

Now, let us prove property (d). Let VC = {uj i; Ui 2 . • • v ip } be a minimum vertex cover of G. 
Then P = opt MlN _ vc (G). Let Gi and G 2 be the genomes obtained by R(G). Let {G\,G®) be an 
exemplarization of (Gi,G 2 ) and let k! be the number of robust intervals of (G^Gf'). Finally, let 
VG' be the vertex cover of G such that VC' = 5(Gi,G|'). We need to find a positive constant [i 
such that \P - \VC'\\ < /3|opt pb (Gi,G 2 ) - k'\. 

For pb E {EcomI, EConsI}, let iV p b be the number of robust intervals between the two genomes 
obtained by F(VC). By the first property of Lemma 4, we have 

optpbCGj, G 2 ) > iV pb > H^(pb) • n + T(pb, G) - VF(pb) • P 



So, it is sufficient to prove that there exists some (5 > such that \P — |VG'|| < /?|VF(pb) • n + 

ra+T(pb 
VK(pb) 

W{pb)-n+T{pb,G)-k' 



T(pb, G) — M^(pb) P — k'\. By the second property of Lemma 4, we have \ VC'\ — ty (P b )' n + T (P b ' G ) k 



Since P < |FG'|, we have \P - \VC'\\ = \VC'\ -P= W ^%{X) ~ P = WW)^ W ^) ■ n + 

T(pb, G) - VF(pb) ■ P — k'). 

So /3 = 1 is sufficient in both cases, since VT(EComI) = 4 and VF(EConsI) = 1, which implies 

W(pb) - x - 

Altogether, we then have |opt MlN _ V c(C) — | VC'| | < 1 ■ |opt pb (Gi, G 2 ) — k'\. 

We proved that the reduction (P, S) is an L-reduction. This implies that for two genomes G\ 
and G 2 , both problems EConsI and EComI are APX-hard even if occ(Gi) = 1 and occ(G 2 ) = 2. 
Theorem 1 is proved. □ 

We extend in Corollary 1 our results for the intermediate and maximum matching models. 



Corollary 1. IComI, MComI, IConsI and MConsI are APX-ftard even when genomes G\ and 
G 2 are such that occ(Gi) = 1 and occ(G 2 ) = 2. 

Proof. The intermediate and maximum matching models are identical to the exemplar model when 
one of the two genomes contains no duplicates. Hence, the APX-hardness result for EComI (resp. 
EConsI) also holds for IComI and MComI (resp. IConsI and MConsI). □ 
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3 EBD is APX-hard 



Consider two genomes G\ and G2 with duplicates, and let EBD (resp. IBD, MBD) be the problem 
which consists in finding an exemplarization (resp. intermediate matching, maximum matching) 
(G'i,G'2,M) of (Gi, Cr2)that minimizes the number of breakpoints between G^ and G 2 . 

EBD has been proved to be NP-complete even if occ(Gi) = 1 and occ(G2) = 2 [7]. Some 
inapproximability results also exist: in particular, it has been proved in [13] that, in the general 
case, EBD cannot be approximated within a factor c log n, where c > is a constant, and cannot 
be approximated within a factor 1.36 when occ(Gi) = occ(G2) = 2. Moreover, for two balanced 
genomes G\ and G2 such that k = occ(Gi) = occ(G2), several approximation algorithms for MBD 
are given. These approximation algorithms admit respectively a ratio of 1.1037 when k = 2 [17], 
4 when k = 3 [17] and 4/c in the general case [19]. We can conclude from the above results that 
IBD and MBD problems are also NP-complete, since when one genome contains no duplicates, 
exemplar, intermediate and maximum matching models are equivalent. 

In this section, we improve the above results by showing that the three problems EBD, IBD and 
MBD are APX-hard, even when genomes G\ and G2 are such that occ(Gi) = 1 and occ(G2) = 2. 
The main result is Theorem 2 below, which will be completed by Corollary 2 at the end of the 
section. 

Theorem 2. EBD is APX-hard even when genomes G\ and G2 are such that occ(Gi) = 1 and 
occ(G 2 ) = 2. 

To prove Theorem 2, we use an L-Reduction from Min-Vertex-Cover-3 to EBD. Let G = 
(V, E) be a cubic graph with V = {v\ . . . v n } and E = {e\ . . . e m }. For each i, 1 < i < n, let e^, e 9i 
and e^ be the three edges which are incident to V{ in G with fi < gi < h{. Let R' be the polynomial 
transformation which associates to G the following genomes G\ and G2, where each gene has a 
positive sign: 

Gi = a ai bi o 2 b 2 . . . a n b n c\ d\ c 2 d 2 . . . c m d m c m+ i 

G2 = ao a n dj n d 9n d^ n b n . . .a 2 df 2 d g2 dh 2 b 2 a\ df 1 d gi d^ b\ c\ c 2 . . . c m c m+ \ 
with : 

— ao = 0, and for each i, 1 < i < n, = i and bi = n + i 

— c m +i = 2n + m + 1, and for each i, 1 < i < m, Ci = 2n + i and d% = In + m + 1 + i 

We remark that there is no duplication in G±, so occ(Gi) = 1. In G2, only the genes di, 
1 < i < m, are duplicated and occur twice. Thus occ(G2) = 2. 

Let G be a cubic graph and VC be a vertex cover of G. Let G\ and G2 be the genomes obtained 
by R'{G). We define F' to be the polynomial transformation which associates to VC, G\ and G2 
the exemplarization F'(VC) = (Gi,Gf ) of (Gi,G2) as follows. For each i such that v.- L ^ VC, we 
remove from G2 the genes df i ,d 9i and dh r Then, for each j, 1 < j < m such that dj still has two 
occurrences in G2, we arbitrarily remove one of these occurrences in order to obtain the genome 
Gf". Hence, (Gi,Gf") is an exemplarization of (Gi,G2). 

Given a cubic graph G, we construct G± and G2 by the transformation R'(G). Given an ex- 
emplarization (GijG^) of (Gi,G2), let S' be the polynomial transformation which associates to 
(Gi, Gf) the set VC = {i>i|l < i < n, a, and bi are not consecutive in Gf 7 }. We claim that VC is a 
vertex cover of G. Indeed, let e p , 1 < p < m, be an edge of G. Genome Gf contains one occurrence 
of gene d p since G 2 is an exemplarization of G2. By construction, there exists i, 1 < i < n, such 
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that dp is in Gf [ai,ai] and such that e p is incident to Uj. The presence of d p in Gf [aj,frj] implies 
that vertex Vi belongs to VG. We can conclude that each edge of G is incident to at least one vertex 
of VG. 

Lemmas 5 and 6 below are used to prove that (R 1 , S') is an L-Reduction from the Min-Vertex- 
Cover-3 problem to the EBD problem. Let G = (V, E) be a cubic graph with V = {v\, Vi . . . v n } 
and E = {e\, e2 ■ ■ ■ e m } and let us construct (Gi, G2) by the transformation R'{G). 

Lemma 5. Let VC be a vertex cover of G and (Gi,Gf) the exemplarization given by F'(VC). 
Then \VC\ = k =>■ i?(Gi,Gf) < n + 2m + k + 1, where -B(Gi,Gf ) is the number of breakpoints 
between G\ and Gf . 

Proof. Suppose \VC\ = k. Let us list the breakpoints between genomes G± and Gf obtained by 
F'iVC). The pairs (6j,aj+i), 1 < i < n — 1, and (b n ,c\) induce one breakpoint each. For all 
i, 1 < i < m, each pair of the form (ci,di) (resp. (di, Cj+i)) induces one breakpoint. For all i, 
1 < i < n, such that V{ G VG, (ai,bi) induces at most one breakpoint. Finally, the pair (ao,ai) 
induces one breakpoint. Thus there are at most n + 2m + k + 1 breakpoints of (Gi, Gf'). □ 

Lemma 6. Let (G^Gf 7 ) 6e an exemplarization of (Gi,G2) and VG' 6e i/ie vertex cover of G 
obtained by S"(Gi, Gf ). We kue B(G u Gf) = k! \VC'\ = k 1 - n - 2m - 1. 

Proof. Let (G^Gf 1 ) be an exemplarization of (Gi,G2) and VG' be the vertex cover obtained by 
S"(Gi, Gf). Suppose B(Gi, Gf ) = fc'. For any exemplarization (Gi, Gf ) of (Gi, G 2 ), the following 
breakpoints always occur: the pair (ao,ai) ; for each i, 1 < i < m, each pair (cj,dj) and (cZ^Cj+i) ; 
for each i, 1 < z < n — 1, the pair (pi, aj+i) ; the pair (b n , c\). Thus, we have at least n + 2m + 1 
breakpoints. The other possible breakpoints are induced by pairs of the form of (aj,6j). Since we 
have B(G\, Gf ) = k', there are exactly k' — n — 2m — 1 such breakpoints. By construction of VG', 
the cardinality of VG' is equal to the number of breakpoints induced by pairs of the form (a,, b{). 
So, we have: |VG'| = k' — n — 2m — 1. □ 

To prove that (R',S ! ) is an L-reduction, we first notice that properties (a) and (b) of an L- 
reduction are trivially verified. The next lemma proves property (c). 

Lemma 7. The inequality opt EBD (Gi, G2) < 12 • opt MlN _ vc (G) holds. 

Proof. For a cubic graph G with n vertices and m edges, we have 2m = 3n (see (4)) and 
°P t MiN-Vc(G ! ) ^ f ( see (5))- By construction of the genomes G\ and G2, any exemplarization of 
(Gi, G2) contains 2n + 2m + 2 genes in each genome. Thus, we have opt EBD (Gi, G2) < 2n + 2m + 2 < 
6n (n > 4 in a cubic graph). Hence, we conclude that opt EBD (Gi, G2) < 12 • opt MlN _ V c(C)- ^ 

Now, we prove property (d) of our L-reduction. 

Lemma 8. Let (Gi,Gf) be an exemplarization of (G±,G2) and let VC be the vertex cover of G 
obtained by S"(Gi,Gf). Then, we have |opt Mm _ vc (G) - |VG'|| < |opt EBD (Gi, G 2 ) - B(Gi,Gf )| 

Proof. Let (Gi, Gf ) be an exemplarization of (Gi, G 2 ) and VG' be the vertex cover of G obtained 
by S'(Gi, Gf ). Let VG be a vertex cover of G such that |VG| = opt MlN _ V c (G) ■ 

We know that opt MlN _ V c(G) < |VG'| and opt EBD (Gi, G2) < B(Gi,Gf). So, it is sufficient to 
prove |VG'| - opt MlN _ vc (G) < B(G U G$) - opt EBD (Gi, G 2 ). 
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By Lemma 5, we have B(F'(VC)) < n + 2m + 1 + opt MlN _ vc , which implies opt EBD (Gi, G2) < 
B(F'(VC)) < n + 2m + 1 + opt MlN _ vc . Then 

B(G 1 , Gf ) - opt EBD (G l ,G 2 ) > B(G U Gf ) - n - 2m - 1 - opt MlN _ vc (G) (8) 

By Lemma 6, we have: \VC'\ = B(G\, Gf") — n — 2m — 1 which implies 

\VC'\ - opt MlN _ vc (G) = 5(Gi, Gf ) - n - 2m - 1 - opt MlN _ vc (G) (9) 

Finally, by (8) and (9), we get \VC'\ - opt MlN _ vc < B(G U Gf ) - opt EBD (Gi, G 2 ). □ 

Lemmas 7 and 8 prove that the pair (R' , S') is an L-reduction from Min-Vertex-Cover-3 to 
EBD. Hence, EBD is APX-hard even if occ(Gi) = 1 and occ(G2) = 2, and Theorem 2 is proved. 
We extend in Corollary 2 our results for the intermediate and maximum matching models. 

Corollary 2. The IBD and MBD problems are APX-hard even when genomes G\ and G2 are 
such that occ(Gi) = 1 and occ(G2) = 2. 

Proof. The intermediate and maximum matching models are identical to the exemplar model when 
one of the two genomes contains no duplicates. Hence, the APX-hardness result for EBD also 
holds for IBD and MBD. □ 



4 Zero breakpoint distance 

This section is devoted to zero breakpoint distance recognition issues. Indeed, in [13], the authors 
showed that deciding whether the exemplar breakpoint distance between any two genomes is zero or 
not is NP-complete even when no gene occurs more than three times in both genomes, i.e., instances 
of type (3,3). This important result implies that the exemplar breakpoint distance problem does 
not admit any approximation in polynomial-time, unless P = NP. Following this line of research, 
we first complement the result of [13] by proving that deciding whether the exemplar breakpoint 
distance between any two genomes is zero or not is NP-complete, even when no gene is duplicated 
more than twice in one of the genomes (the maximum number of duplications is however unbounded 
in the other genome). This result is next extended to the intermediate matching model and we give 
a practical - but exponential - algorithm for deciding whether the exemplar breakpoint distance 
between any two genomes is zero or not in case no gene occurs more than twice in both genomes (a 
problem whose complexity, P versus NP-complete, remains open). Finally, we show that deciding 
whether the maximum matching breakpoint distance between any two genomes is zero or not is 
polynomial-time solvable and hence that such negative approximation results (the ones we obtained 
for the exemplar and intermediate models) do no propagate to the maximum matching model. 
The following easy observation will prove extremely useful in the sequel of the present section. 

Observation 3 Let G\ and G2 be two genomes. If the exemplar breakpoint distance between G\ 
and Gi is zero, then there exists an exemplarization (Gf , Gf ) of (Gi, G2) such that (1) Gf = Gf , 
or (2) —(Gf) r = Gf , where — (Gf ) r is the signed reversal of genome G±. The same observation 
can be made for the intermediate and maximum matching models. 
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4.1 Zero exemplar breakpoint distance 

The zero exemplar breakpoint distance (ZEBD) problem is formally denned as follows. 

Problem: ZEBD 

Input: Two genomes G\ and G 2 . 

Question: Is the exemplar breakpoint distance between G\ and G 2 equal to zero? 



Aiming at precisely defining the inapproximability landscape of computing the exemplar break- 
point distance between two genomes, we complement the result of [13], who showed ZEBD to be 
NP-complete even for instances of type (3,3), by the following theorem. 

Theorem 4. ZEBD is NP -complete even if no gene occurs more than twice in G\. 

Proof. Membership of ZEBD to NP is immediate. The reduction we use to prove hardness is from 
MlN- Vertex-Cover [16]. Let an arbitrary instance of Min-Vertex-Cover be given by a graph 
G = (V, E) and a positive integer k. Write V = {v±, v 2 . . . v n } and E = {ei, e 2 . . . e m }. In the rest of 
the proof, elements of V (resp. E) will be seen either as vertices (resp. edges) or genes, depending 
on the context. The corresponding instance (G\,G 2 ) of ZEBD is defined as follows: 

G\ = vi X\ v 2 X 2 . . . v n X n 
G 2 = Y[1] Y[2] ... Y[k] Yy. 

For each i = 1, 2, . . . , n, Xj is defined to be X, = ej 2 ... e, . , where , ej 2 , . . . , e$ . , i\ < i 2 < 
. . . < ij, are the edges incident to vertex Uj. The strings Y[i], 1 < i < k, are all equal and are 
defined by Y[i] = Yy Ye where Yy = V\ v 2 ... v n and Y% = e.\ e 2 ... e m . 

Notice that no gene occurs more than twice in Gi (actually genes Vi occur once and genes 
occur twice). However, the number of occurrences of each gene in G 2 is upper bounded by k + 1. 
Furthermore, all genes have positive sign, and hence according to Observation 3 we only need to 
consider exemplarizations {Gf,G 2 ) of {G\,G 2 ) such that Gf = G®- 

It is immediate to check that our construction can be carried out in polynomial-time. We now 
claim that there exists a vertex cover of size k in G iff the exemplar breakpoint distance between 
G\ and G 2 is zero. 

Suppose first that there exists a vertex cover V' C V of size k in G. Write V' = {v^ , vi 2 , . . . , v i k }, 
i\ < i 2 < . . . < if-- For convenience, we also define %q to be 0. From V we construct an exemplar- 
ization (Gf,G 2 ) as follows. We obtain Gf from G\ by a two step procedure. First we delete in 
G\ all strings Xj such that Vi £ V. Second, for each 1 < j < m, if gene ej still occurs twice, we 
delete its second occurrence (this second step is concerned with edges connecting two vertices in 
V). We now turn to Gf - For 1 < j < k, we consider the string Y[j] = Yy Ye that we process 
as follows: (1) we delete in Yy all genes but Uj. and those genes V£ ^ V such that ij-i < I < ij, 
and (2) we delete in Ye all genes but those ei that are not incident to or incident to and 
some smaller vertex in V (i.e., ei = {vi.,,Vi-} for some j' < j). Finally, we delete in the trailing 
string Yy = v\ v 2 ... v n all genes but those vi V) such that i& < i. Since V is a vertex cover 
in G, then it follows that each gene occurs once in the obtained genomes, i.e., (GfjGf) is indeed 
an exemplarization of (G±, G 2 ). It is now easily seen that Gf = Gf , and hence that the exemplar 
breakpoint distance between G\ and G 2 is zero. 
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Conversely, suppose that the exemplar breakpoint distance between G\ and G2 is zero. Since all 
genes have a positive sign, then it follows that there exists an exemplarization (Gf , Gf) of (Gi, G2) 
such that Gf = Gf ■ Exemplarization Gf can be written as 

Gf = Y V [1] Y E [l) Y V [2] Y E [2] . . . Y v [k] Y E [k] Y v [k + 1] 

where, Yy[i], 1 < i < k + 1, is a string on V and Ye [2], 1 < i < A:, is a string on E, V and E 
being viewed as alphabets. Now, define V C V as follows: vi € V iff gene V{ occurs in some IV [7], 
1 < j < k, as the last gene. By construction, |V'| < (we may indeed have \V'\ < k if some Yy[j], 
1 < j < fej denotes the empty string). We now observe that, since no gene Vi is duplicated in Gi, 
all genes eg that occur between some gene V{ € V and some gene Uj 6 V in Gf should match genes 
in string Xj in Gi . Then it follows that V 1 is a vertex cover of size at most k in G. □ 

The complexity of ZEBD remains open in case no gene occurs more than twice in G\ and 
more than a constant times in G2, i.e., instances of type (2, c) for some c = O(l) ; recall here that 
ZEBD is NP-complete if no gene occurs more than three times in G± or in G2 (instances of type 
(3,3), [13]). In particular, the complexity of ZEBD for instances of type (2,2) is open. However, 
we propose here a practical - but exponential - algorithm for ZEBD for instances of type (2,2), 
which is well-suited in case the number of genes that occur twice both in G\ and in G2 is relatively 
small. 

Proposition 1. ZEBD for instances of type (2,2) (no gene occurs more than twice in G\ and in 
G2) is solvable in 0*(1.6182 2fc ) time, where k is upper- bounded by the number of genes that occur 
exactly twice in G\ and in G2. 

Proof. According to Observation 3, for any instance (Gi,G2), we only need to focus on exemplar- 
izations (Gf,Gf) such that Gf = Gf or -(Gf) r = Gf, where -(Gf) r is the signed reversal of 
Gf . Let us first consider the case Gf = Gf (the case — (Gf ) r = Gf is identical up to a signed 
reversal and will thereby be briefly discussed at the end of the proof). 

Let (Gi, G2) be an instance of type (2, 2) of ZEBD. Our algorithm is by transforming instance 
(Gi, G2) into a CNF boolean formula cf) with only few large clauses such that <fi is satisfiable iff the 
exemplar breakpoint distance between G\ and G2 is zero. By hypothesis, each signed gene occurs 
at most twice in G\ and in G2. Therefore, for any signed gene g, we have one out of four possible 
distinct configurations depicted in Figure 2, where pi, P2, qi and are positions of occurrence of g 
in Gi and G2. Furthermore, since we are looking for an exemplarization (Gf , Gf ) of (G±, G2) such 
that Gf = Gf , we may assume, in case g occurs only once in G\ or in G2, that all occurrences of G 
have the same sign (otherwise a trivial self-reduction would indeed apply). In other words, referring 
at Figure 2, we assume Gi[pi] = G2[(/i] = G2[g2] in case (2), Gi[pi] = Gi[p2] = G2[(/i] in case (3), 
and Gi[pi] = G2[gi] in case (4). Finally, as for case (1), we may assume that either all occurrences 
have the same sign, or Gi[pi] = — Gi[p2] and G2[q , i] = —G2[q2] (otherwise a trivial self-reduction 
would again apply). 

We now describe the construction of the CNF boolean formula (p. First, the set of boolean 
variables X is defined as follows: for each gene g occurring at position p in G\ and at position q 
in G2 {i.e., \G\\p\\ = IG2 [?])!) we add to X the boolean variable x\. We now turn to defining the 
clauses of <f>. Let g be any gene, and let the occurrence positions of g in G\ and in G2 be noted as 
in Figure 2. 

— if occ(g, G\) = occ(g, G2) = 2 (case(l)), 
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Fig. 2. The 4 gene-configurations for instances of type (2, 2): pi and P2 are the occurrence positions of gene g in G\, 
and 51 and qi are the occurrence positions of gene g in G2. 



- if Gi[pi] = Gi[p2] = G2[gi] = G2[g , 2]> we add to (f> the clauses {xq\ V Xg 2 V x^J V Xq 2 ), 

{Xq x V Xg 2 ) , (Xgj V Xgj ) , {Xq x V Xg 2 ) , {xPq 2 V Xg x ) , {xPq 2 V ) and (Xqj V Xg 2 ) , 

- otherwise, we have Gi[pi] = —G\\p2\ and G2[<?i] = — G2[<72] (see above discussion), 

- if Gi[pi] = G2[(/i] and Gi[p2] = G2[q , 2])), we add to (ft the clauses {x p \ V Xq 2 ) and 

(Xq-t V Xq 2 ) , 

- if Gi[pi] = G2[g2] and Gi[p2] — G2[gi])), we add to (f> the clauses {xq\ V Xql) and 

(Xq 2 V X^q x ) , 

— if occ(g, Gi) = 1 and occ(g, G2) = 2 (case (2)), we add to (j> the clauses (xql Vxq 2 ) and {x p q \ Vxq 2 ), 

— if occ(g, Gi) = 2 and 000(5, G2) = 1 (case (3)), we add to (j> the clauses {x p q \ Vxql) and (xql Vxqf), 
and 

— if occ(g, Gi) = occ(<7, G2) = 1 (case (4)), we add to cp the clause (xq{). 

The rationale of this construction is that if formula (j) evaluates to true for some assignment / 
and f(xq) is true for some gene g occurring at position p in G\ and q in G2, then all occurrences of 
g but the one at position p should be deleted in Gi and all occurrences of g but the one at position 
q should be deleted in G2, in order to obtain the exemplar solution. What is left is to enforce that 
4> evaluates to true iff the exemplar breakpoint distance between Gi and G2 is zero. To this aim, 
we add to 4> the following clauses. For each pair of variables (x 1 ^, x 1 ^) such that 7^ | Gi [^2] 1 5 

i\ < 12 and j\ > we add to <fi the clause {x 1 ^ V x* 2 ). The construction of (f> is now complete. 

Clearly, <p evaluates to true iff the exemplar breakpoint distance between G\ and G2 is zero. 
Let k be the number of genes g that occur twice in G± and in G2 with the same sign, i.e., Gi[pi] = 
G\[p2\ = G2[<7i] = G2[<72]- We now make the important observation that all clauses in <f> have size 
less than or equal to 2 except those k clauses of size 4 introduced in case gene g occurs twice in 
G\ and in G2 with the same sign. By introducing a new boolean variable, we can easily replace in 
4> each clause of size 4 by two clauses of size 3, and hence we may now assume that ^ is a 3-CNF 
formula (i.e., each clause has size at most 3) with exactly 2k clauses of size 3. 

As for the case — (Gf ) r = Gf% we replace G\ by — {G\) r and construct another 3-CNF formula 
<j>' as described above. The two 3-CNF formulas need, however, to be examined separately. 

Fernau proposed in [15] an algorithm for solving 3-CNF boolean formulas that runs in 0*(1.6182 ) 
time, where t is the number of clauses of size 3. Therefore, ZEBD for instances of type (2, 2) is 
solvable in 0*(1.6182 2fc ) time, where k is the number of genes g that occur twice in G\ and in 
G 2 . □ 
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4.2 Zero intermediate matching breakpoint distance 

We now turn to the zero intermediate breakpoint distance (ZIBD) problem. It is defined as follows. 

Problem: ZIBD 

Input: Two genomes Gi and G 2 . 

Question: Is the intermediate breakpoint distance between G± and G 2 equal to zero ? 

We show here that ZEBD and ZIBD are equivalent problems. We need the following lemma. 

Lemma 9 ([2]). Let G\ and G2 be two genomes without duplicates and with the same gene con- 
tent, and G^ and G' 2 be the two genomes obtained from G\ and G2 by deleting any gene g. Then 
B(G' 1 ,G' 2 )<B(G 1 ,G 2 ). 

Theorem 5. ZEBD and ZIBD are equivalent problems. 

Proof. One direction is trivial (any exemplarization is indeed an intermediate matching). The other 
direction follows from Lemma 9. □ 

It follows from Theorem 5 that the problem IBD is not approximable even for instances of type 
(3, 3) (see [13]) and if no gene occurs more than twice in G\ (see Theorem 4). 

4.3 Zero maximum matching breakpoint distance 

We show here that, oppositely to the exemplar and the intermediate matching models, deciding 
whether the maximum matching breakpoint distance between two genomes is equal to zero is 
polynomial-time solvable, and hence we cannot rule out the existence of accurate approximation 
algorithms for the maximum matching model. We refer to this problem as ZMBD. 

Problem: ZMBD 

Input: Two genomes G\ and G 2 . 

Question: Is the maximum matching breakpoint distance between G\ and G 2 equal to 

zero ? 

The main idea of our approach is to transform any instance of ZMBD into a matching diagram 
and next use an efficient algorithm for finding a large set of non-intersecting line segments. Note 
that this latter problem is equivalent to finding a large increasing subsequence in permutations. 

A matching diagram [18] consists of, say n, points on each of two parallel lines, and n straight 
line segments matching distinct pairs of points. The intersection graph of the line segments is called 
a permutation graph (the reason for the name is that if the points on the top line are numbered 
1,2, ... ,n, then the points on the other line are numbered by a permutation on 1,2, ... , n). 

We describe how to turn the pair of genomes (G\,G 2 ) into a matching diagram D(G\,G 2 ). 
For sake of presentation we introduce the following notations. For each gene family g, we write 
occ pos (G,<7) (resp. occ neg (G, g)) for the number of positive (resp. negative) occurrences of gene 
g in genome G. According to Observation 3, it is enough to consider two cases: G^ 1 = G^f or 
-(Gf ) r = Gf , where (Gf ,G%*,M) is a maximum matching of (G U G 2 ). 

Let us first focus on testing G^f = G 2 (the case — {G\ d ) r = G 2 is identical up to a signed 
reversal). We describe the construction of the top labeled points. Reading genome G\ from left to 
right, we replace gene g by the sequence of labeled points 

+gi(i,occ P o S (G 2 ,5)) +gi(^occ pos (G 2 ,5) - 1) ••• +gi(M) 
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if g is the i-th positive occurrence of gene g in genome G\ or by the sequence of labeled points 

-gi(i, occ neg (G 2 , g)) - g 1 (i,occ ncg (G 2 ,g) - 1) ... -g^M) 

if g is the i-th negative occurrence of gene g in genome G\ . A symmetric construction is performed 
for the labeled points of the bottom line, i.e., reading genome G 2 from left to right, we replace gene 
g by the sequence of labeled points 

+g 2 (i,occ pos (Gi,c/)) +g 2 (i,occ pos (G ! i,g) - 1) ... +g 2 (i,l) 

if g is the i-th positive occurrence of gene g in genome G 2 or by the sequence of labeled points 

-g 2 (i,occ ncg (Gi,c;)) - g 2 (i,occ ncg (Gi,0) - 1) ... -g 2 (i,l) 

if g is the i-th negative occurrence of gene g in genome G 2 . We now obtain the matching diagram 
D(G±, G 2 ) as follows: each labeled point +gi(i,j) (resp. — gi(i, j)) of the top line is connected to the 
labeled point +g 2 (j, i) (resp. —g 2 (j,i)) of the bottom line by a line segment. Clearly, each labeled 
point is incident to exactly one line segment, and hence D(G\, G 2 ) is indeed a matching diagram. 

Of particular importance, observe that by construction, for any x € {1, 2} and any two labeled 
points +g x (i,j) and +g x (i,k), j ^ k, the two line segments incident to these two points are 
intersecting ; the same conclusion can be drawn for any two labeled points —g x (i,j) and —g x (i, k), 
j ^ k. The following lemma states this property in a suitable way. 

Lemma 10. 1/ [+9i(h j), +9 2 (j,i)} and [+g 1 (k,£),+g 2 (£,k)) (resp. [-g 1 (i,j),-g 2 (j,i)} and 
[—g 1 (k,£),—g 2 (£,k)]) are two non-intersecting line segments in the matching diagram D{G\,G 2 ), 
then i ^ k and j ^ I. 

Theorem 6. ZMBD is polynomial-time solvable. 

Proof. Let Gi and G 2 be two genomes, and m the size of a maximum matching between Gi and G 2 . 
According to Lemma 10, there exists a maximum matching (G^,G 2 ,-M) of (Gi,G2) such that 
G^ = G^f if there exists m non-intersecting line segments in D(G±,G 2 ). The maximum number 
of non-intersecting line segments in a matching diagram with n points on each line can be found 
in 0(n log log n) time [8]. 

As for the case — (G* / ) r = G 2 , we replace Gi by — (Gi) r and run the same algorithm on the 
obtained matching diagram. □ 

5 Approximating the number of adjacencies in the maximum matching model 

For two balanced genomes G\ and G 2 , several approximation algorithms for computing the number 
of breakpoints between G\ and G 2 are given for the maximum matching model [17, 19]. We propose 
in this section three approximation algorithms to maximize the number of adjacencies (as opposed 
to minimizing the number of breakpoints). The approximation ratios we obtain are 1.1442 when 
occ(Gi) = 2, 3 when occ(Gi) = 3 and 4 in the general case. Observe that in the latter case, 
oppositely to [17, 19], our approximation ratio is independent of the maximum number of duplicates. 
Note also that in [12], inapproximation results are given for two unbalanced genomes G\ and G 2 
even when occ(Gi) = 1 and occ(Gi) = 2. 

We first define the problem Max-/c-Adj we are interested in (k > 1 is a fixed integer). 
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Problem: Max-&-Adj 

Input: Two balanced genomes G\ and G2 with occ(Gi) = k (and consequently occ(G2) = 
k). 

Solution: A maximum matching {G^f ,M) of (G\,Gi). 
Measure: The number of adjacencies between Gf 1 and Gif . 

We define Max-Adj to be the problem MAX fc-ADJ, in which k is unbounded. 
5.1 A 1.1442-approximation for Max-2-Adj 

We focus here on balanced genomes G\ and G2 such that occ(Gi) = 2, and we give an approximation 
algorithm for Max-2-Adj based on the Max-2-CSP problem (defined below), for which a 1.1442- 
approximation algorithm is given in [9]. The main idea is to construct a boolean formula ip for 
each possible adjacency, and next to maximize the number of boolean formulas eft that can be 
simultaneously satisfied in a truth assignment ; the number of simultaneously satisfied formulas 
will be exactly the number of adjacencies, and hence any approximation ratio for Max-2-CSP is 
an approximation ratio for Max-2-Adj. 

Problem: Max-£;-CSP 

Input: A pair (x,^), where \ is a set of boolean variables and $ is a set of boolean 
formulas such that each formula contains at most k literals of %■ 
Solution: An assignment of x- 

Measure: The number of formulas that are satisfied by the assignment. 

We define the following transformation MakeCSP that associates to any instance of Max-2-Adj 
an instance of Max-2-CSP. Given an instance (G\,G2) of Max-2-Adj, we create a variable X g 
for each gene g and define x as the set of variables X g . Then, we construct the set <P of formulas. For 
each duo di = (Gi[i], G\[i + 1]), 1 < i < n Gl — 1, such that di or — di appears in G2, we distinguish 
three cases in order to create a formula (fi of <P: 

1. There exists a unique duo dj = (G^j], G2U + 1]) in G2 such that dj = di or dj = —di. For sake 
of readability, we define the literal Y p q , 1 < p < n Gl , 1 < q < nc 2 i where = [G^?]!, as 
follows: Yp = Xiq^j^ if A^ 1 [p] = Nc 2 [q] and Y p q = -X"|d[p]| otherwise. We now consider two 
cases: 

— (a) di = dj: in that case, ifi = (Y? A Y^^). 

— (b) di = —dj\ in that case, ifi = (Y? +1 A Y? +1 ). 

2. The duo di appears twice in G2. We consider two cases: 

- (c) Ncjii] = Nc 1 [i + 1]: in that case, tpi = (X\ Gl ^ ©X| Gl [ i+1 ]|) where © is the boolean 
function XOR. 

- (d) N Gl [i] + N Gl [i + 1]: in that case, tpi = (X\ Gl[{] \ ffi X| Gl[i+1] |). 

Remark that each formula ifi contains two literals. Hence, (x, is an instance of Max-2-CSP. 

Lemma 11. Let G\ and G2 be two balanced genomes such that occ(Gi) = 2. Let (x,^) be the 
instance of Max-2-CSP obtained by MakeCSP(Gi, G2). For any integer k, if there exists a max- 
imum matching (Gf 1 , G^f , Ai) of {G\,G2) which induces at least k adjacencies, then there exists 
an assignment of the variables of x such that at least k formulas of are satisfied. 
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Proof. Let G\ and G% be two balanced genomes such that occ(Gi) = 2 and let (x, be the instance 
of Max-2-CSP obtained by MakeCSP(Gi, G 2 ). Let k be an integer. 

Suppose there exists a maximum matching (G^, Glf , M) of (Gi,C?2) which induces at least k 
adjacencies. We construct the following assignment of variables of x- For each gene g, we define 
X g = 1 if g is not duplicated, else we define X g = 1 iff the occurrences of g are matched in 
the reading order (see Figure 3). We now show that for each duo which induces an adjacency 
between Gf 1 and G^ 1 , there exists a distinct satisfied formula of <P. Let di = (Gf 1 [i], G^[i + 1]), 
1 < i < iig 1 — 1, be a duo which induces an adjacency, and let dj = (G^[j],G^[j + 1]) be the 
related duo on G% ■ By construction of there exists a formula ifi £ <P which has been previously 
defined in one of the cases (a), (b), (c) or (d) of the definition of MakeCSP. We claim that, for each 
of these cases, ifi is satisfied: 

— (a) ifi = (y/ A^ 1 ) an d di = dj. We first prove that literal is true. Three cases are possible, 
(i) The gene |Cri[£]| is not duplicated ; then we have defined in our assignment X^mi = 1. 
Moreover, we have Y? = X\q x u[\\ (since Na^i] = Ng 2 [j} = 0), hence Y? is true, (ii) The gene 
|Gi[i]| is duplicated and -/Vcji] = Nc 2 \j] ] then, by definition of our assignment and since G±[i] 
and G2[j] are matched together in the maximum matching (G^ 1 , G^ 1 ,A4), we have Xi^ui = 1 
(we match signed genes in the reading order). Moreover, we have Y? = Xiq^v which induces 
that Y? is true, (iii) The gene |Gi[z]| is duplicated and Nq^i] ^ Nc 2 \j] ; then, by definition 
of our assignment and since G\[i] and G^fj] are matched together in the maximum matching 
(G^, G^, -M), we have Xi G » 1 r i ii = (we do not match signed genes in the reading order). 
Moreover, we have in this case Y? = Xiq^v which induces that Y? is true. 

In each case, we have proved that Y? is true. We can also prove that is true, using the 

same arguments. Hence, we conclude that <pi is true. 

— (b) ifi = Y? + A Y? +1 and di = —dj. By similar arguments as in case (a), we can prove that 
Y? + and Y? +1 are true. 

— (c) We have iVc^i] = Nq^I + 1] and the duo di appears twice in G2 (noted dj and dji). 
Since di induces an adjacency, the duo di matches either dj or dji. In these two cases, we have 
^|Gi[i]| = -^IGifi+ill (otherwise G\[i] and G±[i + 1] would not match successive signed genes). 
Moreover, <pi = (X\g^]{\\ ®^|Gi[i+i]|) and thus, (fi is true. 

— (d) We have A^g 1 [z] ^ Nc^i + 1] and the duo di appears twice in G2 (noted dj and dji). 
Since di induces an adjacency, the duo di matches either dj or dj'. In these two cases, we have 
^|Gi[i]| 7^ -X"|Gi[i+i]| (otherwise G\[i\ and G\[i + 1] would not match successive signed genes). 
Moreover, tpi = (X\ Gl $\ ®^|Gi[i+i]|) and thus, tpi is true. 

We have constructed a variable assignment of x such that, for each duo di in G^ 1 which implies 
an adjacency, there exists a distinct satisfied formula ipi £ <P. Thus, if there exists a maximum 
matching of (G\,G2) which induces at least k adjacencies, then the corresponding assignment 
implies at least k satisfied formulas. □ 

Lemma 12. Let G\ and G2 be two balanced genomes such that occ(Gi) = 2. Let (x>^) be the 
instance of Max-2-CSP obtained by MakeCSP(Gi, G 2 ). For any integer k, if there exists an as- 
signment of x such that at least k formulas of <L> are satisfied, then there exists a maximum matching 
(Gf 1 , G^ , M) of (G\,G2) which induces at least k adjacencies. 
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Fig. 3. All possibilities of assignment: Xa = 1 (gene A occurs twice and signed genes are matched in the reading 
order), Xb = 1 or Xg = (gene B occurs once) and Xc = (gene C occurs twice and signed genes are not matched 
in the reading order). Note that this construction is independent of the sign of the genes. 

Proof. Let G\ and G 2 be two balanced genomes such that occ(Gi) = 2 and let (x, &) be the instance 
of Max-2-CSP obtained by MakeCSP(Gi, G 2 ). Let k be an integer. 

Suppose there exists an assignment of x such that at least k formulas ifi £ <P are satisfied. We 
create the following maximum matching (G^ 1 , , Ai) of (G\, G2). For each variable X g such that 
the gene g is duplicated, we match the occurrences of g in the reading order if X g = 1 (such as gene 
A in Figure 3). If we have X g = 0, we match the first occurrence of g on G\ with the second one 
on G2 and the second occurrence of g on G\ with the first one on G2 (such as gene C in Figure 3). 
Then, we match signed genes which are not duplicated. Now, we prove that each satisfied formula 
(fi € <P induces a distinct adjacency for (Gf 1 , G^ 1 , M.). Let ifi 6 <P be a satisfied formula which is 
defined in one of the cases (a), (b), (c) or (d) of the definition of MakeCSP: 

- (a) We have w = (Y/ A Y^) and the duos di = (d [i] , d [i + 1]) and dj = (G 2 [j],G 2 [j + 1]) 
are identical. 

Here, we must prove that di and dj are matched together in (G^ , G?f , M) and thus induce an ad- 
jacency. First, we show that signed genes G±[i] and G 2 [j] are matched together in (G^ 1 , G^ , M). 
Since ifi is satisfied, we have Y? = 1. We must dissociate three cases: (i) the gene |Gi[*]| is n ot 
duplicated: in that case, the signed gene G\[i] can be matched only with G^bl- (") The gene 
|Gi[i]| is duplicated and we have A^Ji] = Nq 2 \j]. In that case, we have defined Y? = X\Q x uji 
which implies = 1. Thus, since iVo 1 [i] = Na 2 [j], the signed genes G\[i] and G 2 [j] are 

matched together, (iii) The gene |Gi[z]| is duplicated and we have iVdfi] 7^ Nc 2 [j]. In that 
case, we have defined Y^ = X| Gl ^| which implies X^ Gl ^ = 0. Thus, since Na^i] 7^ No 2 \j], the 
signed genes G\[i] and G 2 [j] are matched together. For each case, the signed genes G\[i\ and 
G 2 [j] are matched together. We can conclude in the same way that G±[i + 1] and G 2 [j + 1] are 
also matched together, which implies that di induces an adjacency. 

- (b) We have tpi = (Y/ +1 A}f +1 ) = 1 and the duos d { = (Gi[i], G^i+l]) and dj = (G 2 [j], G 2 [j+1}) 
are reversed. 

We can use the same reasoning used in case (a) to prove that di induces an adjacency. 

— (c) The duo di appears twice in G 2 (noted dj and dj'). We have ipi = (Xi^mi © and 
N Gl [i] = N Gl [i + I}. 

Since ipi is true, we have X^ Gl ^ = X^ Gl ^ +i ^ which implies by construction of the maximum 
matching that di matches dj or dy. 

— (d) The duo di appears twice in G 2 (noted dj and dji). We have tpi = (Xi^mi © -^|Gi[i+i]|) an d 
N Gl [i] 7^ N Gl [i + 1]. Since tfi is true, we have -X^i^i -^|Gi[«+i]| which implies by construction 
of the maximum matching that di matches dj or dji. 
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Consequently, for each satisfied formula, there exists a distinct adjacency between G^ and G% . 
Thus, if there exists an assignment of x which implies at least k satisfied formulas of <P, then there 
exists a maximum matching of (G\,G2) which implies at least k adjacencies. □ 

Lemmas 11 and 12 prove that any a- approximation for Max-2-CSP implies an a-approximation 
for Max-2-Adj. In [9], an approximation algorithm is given for Max-2-CSP, whose approximation 
ratio is equal to Q g74 < 1.1442. Thus, we have the following theorem. 

Theorem 7. Max-2-Adj is 1.1442 -approximable. 
5.2 A 3-approximation for Max-3-Adj 

Now, we present a 3-approximation for Max-3-Adj by using the Maximum Independent Set 
problem defined as follows: 

Problem: Max-Independent-Set 
Input: A graph G = (V, E). 

Solution: An independent set of G (i.e. a subset V of V such that no two vertices in V' 
are joined by an edge in E). 
Measure: The cardinality of V . 



In [17], Goldstein ct al. used Max-Independent-Set to approximate the Minimum Common 
String Partition problem by creating a conflict graph. We construct in the same way an instance 
of Max-Independent-Set where a vertex represents a possible adjacency and where an edge 
represents a conflict between two adjacencies. We define MakeMIS to be the following transformation 
which associates to two balanced genomes Gi and G2 an instance of Max-Independent-Set. We 
construct a vertex for each duo match, and then we create an edge between two vertices when they 
are in conflict, i.e. when two matches are incompatible. Figure 4 illustrates the graph obtained by 
MakeMIS(Gi, G 2 ) where d = +3 + 1 + 2 + 3 + 4 + 2 + 5 and G 2 = +3 + 4 + 2 + 3 + 1 + 2 + 5. 



3:1 2:3 4 2 5 
° 3 4 2 3:1 2:5 

3 1 2 3 4:2 5: 
3 4 2 3 1:2 5: 



Fig. 4. The conflict graph obtained by MakeMIS(Gi, G 2 ) where Gi =+3 + 1 + 2 + 3 + 4 + 2 + 5 and G 2 = +3 + 4 + 
2 + 3 + 1 + 2 + 5 (for sake of readability, positive signs are not displayed) . 

In order to prove that there exists a 3-approximation for Max-3-Adj, we give the following 
intermediate lemmas. 



3 1:2 3 4 2 5 
3 4 2:3 "i: 



3 1:2 .3:4 2 5 
3 4:2 : 3 : :1 2 5 
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Lemma 13. Let G\ and G2 be two balanced genomes and let G be the graph obtained by 
MakeMIS(Gi, G2). For any integer k, there exists an independent set V' of G such that \V'\ > k iff 
there exists a maximum matching (G^G^-M) of {G\,G2) which induces at least k adjacencies. 

Proof. Let G\ and G2 be two balanced genomes and let G be the graph obtained by 
MakeMIS(Gi, G 2 ). Let k be an integer. 

(=^) Suppose there exists an independent set V of G such that \V'\ > k. We construct a 
matching (Gf 1 ,G^ , M) of (Gi,G2) as follows: first, for each vertex of V', we match together the 
two corresponding duos, thus inducing one adjacency (called a definite adjacency). By construction 
of G, this operation is possible. Indeed, two vertices which are not connected in G imply two 
compatible adjacencies. Then, we match arbitrarily the unmatched genes. This operation cannot 
break any definite adjacency. Finally, we obtain a maximum matching {Gf 1 , G 2 , Ai) which induces 
at least \V'\ adjacencies, and consequently at least k adjacencies. 

(<=) Suppose there exists a maximum matching (Gf^G^-M) of (G\,G2) which induces at 
least k adjacencies. We construct a set V by taking each vertex which represents a duo match 
between Gf 1 and G 2 ■ By construction of G, V is an independent set (no pair of adjacencies can 
create a conflict), and then we have \V'\ > k. □ 

Lemma 14. Let G\ and G2 be two balanced genomes such that occ(Gi) = k. The maximum degree 
A of the graph G obtained by MakeMIS(Gi, G2) satisfies A < 6(k — 1). 

Proof. Let G\ and G2 be two balanced genomes such that occ(Gi) = k and let G be the graph 
obtained by MakeMIS(Gi, G2). Consider a duo match m = (di,^) with d\ = (G\[i\, G\[i + 1]) and 
d2 = (Cr2 [7] , G2 [j + 1]) where 1 < i < nc 1 — 1 and 1 < j < «g 2 — 1. We claim that the vertex v m of 
G, which represents the duo match m, is connected to at most 6(k — 1) vertices. For this, we list the 
possible duo matches m' = (di,d 2 ) such that the vertex v m i of G which represents m' is connected 
to v m . Remark that if v m i is connected to v m (i.e. m and m' are in conflict), then at least one of 
the duos d[ and d' 2 overlaps, respectively, either d\ or a^. Let d' t be a duo in G\ which overlaps d\. 
First, we list the possible duos d' 2 such that the duo matches m = (di,^) and m! = (d^d'2) are in 
conflict. Remark that d[ (or —d[) appears at most k times on G2 since a gene can occur at most k 
times. We then distinguish three cases: 

— (a) di = {G\[i — l],Gi[i]): if d[ (or — d[) appears k times in G2, one of these occurrences is 
necessary d' 2 = (G2L7 — 1] , C?2 [j] ) if di = d 2 , or d' 2 = (G2U + 1], Gab +2]) if di = — a^. For these 
two cases, the duo matches m and (d'i, d' 2 ) are not in conflict. 

— (b) = d\\ if d'i (or —d[) appears k times on G2, one of these occurrences is necessary ^2, 
which induces in this case no conflict with m. 

— (c) d^ = (G\[i + 1], G\[i + 2]): if d[ (or —d'i) appears k times on G2, one of these occurrences is 
necessary d 2 = (G2U + 1], G2U + 2]) if d\ = d 2 , or d 2 = (G2U — 1], G2U]) if d\ = — di- For these 
two cases, the duo matches m and m! are not in conflict. 

For each case, one of the k possible duos d 2 does not imply a conflict between m and m! . Thus, 
for any duo d\ which overlaps d\ , there exists at most k — 1 duos d 2 on G2 such that m and m' are 
in conflict. Using the same arguments, we can easily prove that for any duo d 2 which overlaps d2, 
there exists at most k — 1 duos d^ on G\ such that m and m' are in conflict. Hence, each of the 
six duos which overlaps d\ or c?2 implies at most k — 1 conflicts. Thus, we obtain at most 6(k — 1) 
vertices which are connected to the vertex v m in the conflict graph. □ 
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According to Lemma 13, any a-approximation for Max-Independent-Set is thus also an 
a-approximation for Max-&-Adj. It is proved in [5] that Max-Independent-Set that is ap- 
proximable within ratio — jj-^j where A is the maximum degree of the graph. Combining this with 
Lemma 14, we obtain the following result. 

Theorem 8. Max-/c-Adj is 6fc 5 ~ 3 -approximable. 

Note that in the case where k = 2, we obtain a ratio of 1.8, which is not better than the one 
obtained in Theorem 7. Moreover, we introduce in the next section a 4-approximation in the general 
case. Hence, the only interesting case of Theorem 8 above is when k = 3, inducing a 3-approximation 
for Max-3-Adj. 

5.3 A 4-approximation for Max-Adj 

In [14], a 4-approximation algorithm for the Max- Weighted 2-interval Pattern problem 
(MAX-W2IP) is given. In the following, we first define MAX-W2IP, and next we present how 
we can relate any instance of Max-Adj to an instance of MAX-W2IP. 

The Maximum Weighted 2-Interval Pattern problem. A 2-interval is the union of two disjoint 
intervals defined over a single line. For a 2-interval D = (/, J), we always assume that the interval 
/ < J, i.e., I is completely on the left of J does not overlap J. We say that two 2-intervals 
D\ = (ii, J\) and D 2 = (I2, J2) are disjoint if D\ and D2 have no common point (i.e. (I± U J±) n 
(I2U J 2) = 0). Three possible relations exist between two disjoint 2-intervals: we write (1) D\ -< D2, 
if h < Ji < I 2 < J2, (2) D 1 C D 2 , if h < h < Ji < h and (3) D l D 2 , if h < I 2 < J x < J 2 . 

We say that a pair of 2-intervals D\ and D2 is R-comparable for some R € {-<,□,()}, if either 
(Di,D 2 ) € R or (D 2 ,Di) 6 R. A set of 2-intervals T> is 7£-comparable for some 1Z C {-<,□,$}, 
72. ^ 0, if any pair of distinct 2-intervals in T> is i?-comparable for some R € 1Z. The non-empty 
set 1Z is called a model. The Max- Weighted 2-interval Pattern (Max-W2IP) problem is 
formally defined as follows. 

Problem: Max- Weighted 2-interval Pattern (Max-W2IP) 

Input: A set T> of 2-intervals, a model 1Z C c, ()} with 72 7^ 0, and a weight function 
w : V -> N+. 

Solution: An 72-comparable subset 7J' of T>. 
Measure: The weight of D'. 



Transformation. We first describe how to transform any instance (Gi,^) of Max-Adj into an 
instance, referred hereafter as Make2l(Gi, G 2 ) = (T>,1Z,uj), of MAX-W2IP. We need a new defini- 
tion. Let G\ and G2 be two balanced genomes. An interval I\ of Gi and an interval I2 of G2, both 
of size at least 2, are said to be identical if they correspond to the same string up to a complete 
reversal, where a reversal also changes all the signs in the string. Clearly, two identical intervals 
have the same length. 

The weighted 2-interval set T> is obtained as follows. We first concatenate G\ and G2, and 
for any pair (Ii,I 2 ) of identical intervals {I\ is an interval of G\ and I 2 is an interval of G 2 ), we 
construct the 2-interval D = (Ii,I 2 ) of weight uj{D) = \ — 1 (= I/2I — 1) and add it to T>. Notice 
that, since identical intervals have length at least 2, each 2-interval of T> has weight at least 1. 
Figure 5 gives an example of such a construction. Observe that, by construction, no two 2-intervals 
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of T> are {-<}-comparable. The construction of the instance of MAX-W2IP is complete by setting 
1Z = i.e., we are looking for disjoint 2-intervals, no matter what the relation between 

any two disjoint 2-interval is. Therefore, for sake of abbreviation, we shall denote the corresponding 
instance simply as Make2l(Gi, G2) = (T>,u) and forget about the model. 



+ l"" + 2"-3 +2 +1 +2 +1 +3" -2-1 



Fig. 5. 2-intervals induced by genomes Gi = +1 +2 — 3 +2 +1 and G2 = +2 +1 +3 — 2 — 1. For readability, 
singleton intervals are not drawn. The dotted 2-interval is of weight 2, while all other 2-intervals are of weight 1. 

We now describe how to transform any solution of MAX-W2IP into a solution of Max-Adj. 
Let G\ and G2 be two balanced genomes and Make2l(Gi , G2) = (T>,u>). Furthermore, let S C D be 
a set of disjoint 2-intervals, i.e. a solution for MAX-W2IP for model the {-<, C, §} for the instance 

We write Max-W2IP_to_Adj(5) for the transformation of S into a maximum matching (Gf 1 , G^ 1 , M) 
of (G\, G2) defined as follows. First, for each 2-interval D = I2) of S, we match the signed genes 
of I\ and I2 in the natural way ; then, in order to achieve a maximum matching (since each signed 
gene is not necessarily covered by a 2-interval in S), we apply the following greedy algorithm: 
iteratively, we match, arbitrarily, two unmatched signed genes g\ and 52 such that \g\\ = \g2\ and 
gi is a gene of Gi (i = 1, 2), until no such pair of signed genes exists. After a relabeling of signed 
genes according to this matching (denoted M), we obtain a maximum matching (Gf 1 , G^, M) of 
(Gi, G2). 

The rationale of this construction stems from two following lemmas. 

Lemma 15. Let G\ and G2 be two balanced genomes, Make2l(Gi, G2) = (f, <-<-") and S be any set 
of disjoint 2-intervals ofT>. If we denote by Ws the total weight of S, then the maximum matching 
(G^ 4 , Gf \M) of (G\,G2) obtained by Max-W2IP_to_Adj( l S) induces at least Ws adjacencies. 

Proof. For each 2-interval D = (I\,l2) of S, we have matched the signed genes of I\ and I2 
in the natural way. Therefore, for each 2-interval D = (Ji,^) of S, we obtain — 1 adja- 
cencies in (G^, G^, -M) since I\ and I2 are identical intervals. Since the final greedy part of 
Max-W2IP_to_Adj( t S) does not delete any adjacency, we have at least Ws adjacencies in (G^ 4 , G^ 1 ,M). 

□ 

Lemma 16. Let G\ and G2 be two balanced genomes, (Gf 1 ', G^ , A4) be a maximum matching of 
(G\,G2), Make2l(Gi, G2) = (T> ,uj) and W be the number of adjacencies induced by (Gf^^G^jM). 
Then there exists a subset S CP of disjoint 2-intervals of total weight W . 

Proof. Denote by n the size of Gf 1 . Consider any factorization G^f = s\S2 ■ ■ ■ s p such that, for each 
1 < i < P, Si and Sj+i are separated by one breakpoint and no breakpoint appears in Sj, 1 < i < p. 
Therefore, there exists p— 1 breakpoints between G^ 1 and G^ 1 , and hence n—p adjacencies between 
Gf 1 and G\ l . To each substring Sj of the factorization of G^ corresponds a substring ti in G^ 
such that Si and ti are identical. Moreover, each substring Si of size li, 1 < i < p, contains U — 1 
adjacencies. We construct the 2-interval set S as the union of Di = tj), 1 < i < p, where Si (resp. 
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ti) is the interval obtained from S{ (resp. tj). The factorization of G^ 1 implies that the constructed 
2-intervals are disjoint, and hence the total weight of S is (k ~ 1) = Yli=i k — Sf=i 1 = 

n — p = W. □ 

We now describe Algorithm ApproxAdj and then prove it to be a 4-approximation algorithm for 
Max-Adj. 



Algorithm 1 ApproxAdj 

Require: Two balanced genomes Gi and G2. 

Ensure: A maximum matching (Gf,G^,M) of (Gi,G 2 ). 

- Let Make2l(Gi,G 2 ) = (2?,w). 

- Invoke the 4-approximation algorithm of Crochemore et al. [14] to obtain a set of disjoint 2-intervals SCI). 

- Construct the maximal matching {Gf ,M) = Max-W2IP_to_Adj(<S). 



Theorem 9. Algorithm ApproxAdj is a 4- approximation algorithm for Max- Adj. 

Proof. According to Lemmas 15 and 16, there exists a maximum matching (Crf, Gtf , A4) of 
(Gi,(j2) that induces W adjacencies iff there exists a subset of disjoint 2-intervals S C T> with 
total weight W . Therefore, any approximation ratio for MAX-W2IP implies the same approxima- 
tion ratio for Max-Adj. In [14], a 4-approximation algorithm is proposed for MAX-W2IP. Hence, 
Algorithm ApproxAdj is a 4-approximation algorithm for Max-Adj. □ 

6 Conclusions and future work 

In this paper, we have first given new approximation complexity results for several optimization 
problems in genomic rearrangement. We focused on conserved intervals, common intervals and 
breakpoints, and we took into account the presence of duplicates. We restricted our proofs to cases 
where one genome contains no duplicates and the other contains no more than two occurrences 
of each gene. With this assumption, we proved that the problems consisting in computing an ex- 
emplarization (resp. an intermediate matching, a maximum matching) optimizing any of the three 
above mentioned measures is APX-hard, thus extending the results of [7, 10, 13]. In a second part 
of the paper, we have focused on the ZEBD (resp. ZIBD, ZMBD) problems, where the question is 
whether there exists an exemplarization (resp. intermediate matching, maximum matching) that in- 
duces zero breakpoint. We have extended a result from [13] by showing that ZEBD is NP-complete 
even for instances of type (2, k), where k is unbounded. We also have noted that ZEBD and ZIBD 
are equivalent problems, and shown that ZMBD is in P. Finally, we gave several approximation 
algorithms for computing the maximum number of adjacencies of two balanced genomes under the 
maximum matching model. The approximation ratios we get are 1.1442 for instances of type (2, 2), 
3 for instances of type (3, 3) and 4 in the general case. Concerning the latter result, we note that 
the approximation ratio we obtain is constant, even when the number of occurrences in genomes is 
unbounded. 

However, several problems remain unsolved. In particular, concerning approximation algorithms, 
virtually nothing is known (i) in the case of unbalanced genomes and (ii) in the exemplar and 
intermediate models. Indeed, all the existing results (see for instance [17, 19] for the number of 
breakpoints), including ours, focus on the maximum matching problem for balanced genomes, 
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which implies that no gene is deleted from genomes G± and Gi- Now, if we allow genes to be 
deleted, the problem seems much more difficult to tackle. 

Finally, we would like to recall the following open problem from [11]: what is the complexity of 
ZEBD for instances of type (2,2) ? 
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